Modeling Uncertainty in Duplicate Elimination

نویسندگان

  • George Beskales
  • Mohamed A. Soliman
  • Ihab F. Ilyas
چکیده

Real-world databases experience various data quality problems of different causes including heterogeneity of consolidated data sources, imprecision of reading devices, and data entry errors. Existence of duplicate records is a prominent data quality problem. The process of duplicate elimination often involves uncertainty in deciding on the true duplicates. Current tools resolve such uncertainty either through expert intervention, which is not always possible, or by taking destructive decisions that may lead to unrecoverable errors. In this paper, we approach duplicate elimination from a new perspective treating deduplication procedures as data processing tasks with uncertain outcomes. We propose a complete uncertainty model that compactly encodes the space of clean instances of the input data, and introduce efficient model implementations. We extend our model to capture the behavior of the deduplication process, and allow revising and updating the modeled uncertainty. We apply our model and techniques to state-of-the-art deduplication algorithms to demonstrate the added value of our methods. Our experimental study evaluates the complexity and scalability of our techniques in different configurations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Flexible Fuzzy Expert System for Fuzzy Duplicate Elimination in Data Cleaning

Data cleaning deals with the detection and removal of errors and inconsistencies in data, gathered from distributed sources. This process is essential for drawing correct conclusions from data in decision support systems. Eliminating fuzzy duplicate records is a fundamental part of the data cleaning process. The vagueness and uncertainty involved in detecting fuzzy duplicates make it a niche, f...

متن کامل

Modeling and Querying Possible Repairs in Duplicate Detection

One of the most prominent data quality problems is the existence of duplicate records. Current duplicate elimination procedures usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. Furthermore, replacing the inp...

متن کامل

Modeling Critical Flow through Choke for a Gas-condensate Reservoir Based on Drill Stem Test Data

Gas-condensate reservoirs contain hydrocarbon fluids with characteristics between oil and gas reservoirs and a high gas-liquid ratio. Due to the large gas-liquid ratio, wellhead choke calculations using the empirical equations such as Gilbert may contain considerable error. In this study, using drill stem test (DST) data of a gas-condensate reservoir, coefficients of Gilbert equation was modifi...

متن کامل

A knowledge-based approach for duplicate elimination in data cleaning

Existing duplicate elimination methods for data cleaning work on the basis of computing the degree of similarity between nearby records in a sorted database. High recall can be achieved by accepting records with low degrees of similarity as duplicates, at the cost of lower precision. High precision can be achieved analogously at the cost of lower recall. This is the recall–precision dilemma. We...

متن کامل

Duplicate Detection of Records in Queries Using Clustering

The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008